Fine-Grained Document Genre Classification Using First Order Random Graphs

نویسندگان

  • Andrew D. Bagdanov
  • Marcel Worring
چکیده

We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our method uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to significantly outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Content-free Document Genre Classification using First Order Random Graphs

We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...

متن کامل

Label Propagation for Fine-Grained Cross-Lingual Genre Classification

Cross-lingual methods can bring the benefits of genre classification to languages which lack genre-annotated training data. However, prior work in this field has been evaluated on coarse genres only. To predict fine-grained genres across languages, we propose a label propagation method, which combines separate sets of features. The results are promising, as the approach outperforms most baselin...

متن کامل

Enhanced Genre Classification through Linguistically Fine-Grained POS Tags

We propose the use of fine-grained part-of-speech (POS) tags as discriminatory attributes for automatic genre classification and report empirical results from an experiment that indicate substantial accuracy gain by such features over the conventional bag-of-words approach through word unigrams. In particular, this paper reports our research to investigate the performance of a fine-grained tag ...

متن کامل

Fine-Grained Sentiment Analysis for Movie Reviews in Bulgarian

We present a system for fine-grained sentiment analysis in Bulgarian movie reviews. As this is pioneering work for this combination of language and sentiment granularity, we create suitable, freely available resources: a dataset of movie reviews with fine-grained scores, and a sentiment polarity lexicon. We further compare experimentally the performance of classification, regression and ordinal...

متن کامل

Discovering Fine-Grained Sentiment with Latent Variable Structured Prediction Models

In this paper we investigate the use of latent variable structured prediction models for fine-grained sentiment analysis in the common situation where only coarse-grained supervision is available. Specifically, we show how sentencelevel sentiment labels can be effectively learned from document-level supervision using hidden conditional random fields (HCRFs) [10]. Experiments show that this tech...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001